library(rvest)
library(tidyverse)
library(sf)
library(rnaturalearth)
library(gganimate)
This lesson will cover the following:
We will be investigating police killings across Jamaica over time. A 2001 Amnesty International report explained that “the loss of life at the hands of the Jamaican Constabulary Force (JCD) borders on a human rights emergency. The rate of lethal police shootings in Jamaica is one of the highest per capita in the world.” A 2016 subsequent report found that INDECOM established new systems for police accountability since it was established in 2010, with substantial reductions in police killings. However, “although the number of killings by police has fallen in the past two years, the way the police operate and unlawfully kill remains largely unchanged. The vast majority of victims are young men and teenagers.”
In this lesson, we will examine how police killings have evolved over time in each of Jamaica’s 14 parishes, beginning in 2017 when the information is first available online. We will be scraping data from Jamaica’s Independent Commission of Investigations (INDECOM), a “civilian staffed state agency tasked to undertake investigations concerning actions by members of the Security Forces and other Agents of the State that result in death or injury to persons.”
For a more complete introduction to web scraping with R using the rvest package, I recommend Chapter 24 of Hadley Wickham’s R for Data Science. Here we will focus on getting rvest to work for the task at hand.
Let’s take a look at one of the URLs we will be scraping: https://www.indecom.gov.jm/report/2017-security-force-related-fatalities. We need to understand a bit about HTML, the language that describes web pages. In your browser, you can right click (or equivalent) near the top of the data table and select Inspect to view the source HTML code for the table:
This will show you a list of HTML elements. Before we proceed with the scraping, let’s review a little bit about HTML. HTML stands for HyperText Markup Language and looks something like this:
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1 id='first'>A heading</h1>
<p>Some text & <b>some bold text.</b></p>
<img src='myimg.png' width='100' height='100'>
</body>
As Hadley Wickham explains, HTML has a hierarchical structure formed by elements which consist of a start tag (e.g., <tag>), optional attributes (id='first'), an end tag (like </tag>), and contents (everything in between the start and end tag).
First, we need the URL of the page to scrape. Next we’ll need to read the HTML for that page into R using read_html(). This returns an xml_document object that we can manipulate using rvest functions.
url <- "https://www.indecom.gov.jm/report/2017-security-force-related-fatalities"
webpage <- read_html(url)
webpage
## {html_document}
## <html lang="en-US">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="report-template-default single single-report postid-609 grou ...
The xml object includes all of the HTML, our task is to find the HTML element that includes the information we’re looking for. In this case we’re lucky, the data is already stored in an HTML table! When you inspect the source code for the table, you should be able to see that the information we’re looking for is nested within a <table> element. Note the <table> tag from the below screenshot:
rvest includes a function that knows how to read data from a <table> element: html_table(). This function returns a list containing one data frame for each table found on the page.
We’ve already read the HTML into R as an object called webpage. Next we’ll use html_table() to extract the contents of the data table we’re interested in.
tables <- webpage %>% html_table() # extract HTML tables from the full HTML
length(tables) # check how many tables were extracted
## [1] 1
str(tables) # shows that tables is a list of data frames
## List of 1
## $ : tibble [174 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ X1: chr [1:174] "Total Shooting Fatalities" "1" "2,3" "4" ...
## ..$ X2: chr [1:174] "Date of Indicent" "1-Jan" "1-Jan" "3-Jan" ...
## ..$ X3: chr [1:174] "Name of Deceased" "Tavoy CHRISTIE" "Dabian CAMPBELL, Martin ROYES" "Chrishawn C. DRYSDALE" ...
## ..$ X4: chr [1:174] "Location of Incident" "Gordon Crescent, Granville, St James" "Yacca Ave, Kingston" "Hudson Ave, Kingston 11" ...
## ..$ X5: chr [1:174] "Related State Agency" "JCF" "JCF" "JCF" ...
tables[[1]] # view the first (and only) table
This seems to work, we have the basic syntax to read the HTML from webpages! But we have some data cleaning to do. Before we do our data cleanings, let’s build the complete list of URLs to scrape and extract data tables for all 8 years.
Note that the URLs for found on the INDECOM website for each year of data follow different structures:
Let’s start by keeping the URL for 2017 and using it to populate a data frame, urldf, with one column by the name of urllist. We can then loop through each year to create the correct URL text for each year, noting that the URLs for 2021-2024 follow a different structure than for 2018-2020.
# create data frame with one column, urlllist, use 2017 URL to fill first row
urldf <- data.frame(ulist = url)
# loop through remining years to create URL text
for (year in 2018:2024){
if (year >= 2018 & year < 2021){
auxi_url <- paste0("https://www.indecom.gov.jm/report/",year,"-security-forces-related-fatalities")
}
if (year >= 2021){
auxi_url <- paste0("https://www.indecom.gov.jm/report/",year,"-security-forces-fatal-shootings")
}
urldf <- urldf %>% bind_rows(data.frame(ulist = auxi_url))
}
# inspect
urldf
dim(urldf)
## [1] 8 1
We’ve already seen how to extract a single HTML table using read_html(). Next let’s convert it into a data frame and take a glimpse.
webpage1 <- read_html(urldf[1,1]) %>% # test by reading in first url from urldf
html_table() %>%
as.data.frame()
head(webpage1)
When we convert the HTML table to a data frame, we can see that the original column names appear in the first row. We can use the first row to set the correct column names for this data frame, and then remove the first row.
new_names <- as.character(webpage1[1, ]) # assign values in 1st row to a char vector
colnames(webpage1) <- new_names # use this char vector to set column names
webpage1 <- webpage1[-1, ] # remove the 1st row
Next, let’s build a function that can does all of this for every URL in urllist, and stores each table as a data frame.
The tricky part is identifying and handling the idiosyncrasies of the tables from year to year. In other words, the tables on the webpages for each year can look a bit different. Specifically, we can see that the 2023 table has an extra column at the beginning of the table.
In all other years, the first column is a counter for the number of deaths at the hands of police, i.e. ‘Total Fatals’. But in 2023, it’s the second column that counts the number of deaths. It turns out that this extra column is unnamed in the original table, which means when we use the first row to set column names, the extra column is once again unnamed. This causes problems for working with the data frame, as every column must have an appropriate name. So let’s address it now, along with the other steps we need to prepare a data frame for each yearly table of police killings. To address it, we’ll use an if statement to identify the 2023 table, rename the ‘Total Fatals’ column as ‘input_id’, and rename the unnamed column as ‘empty_name’. For other years, we’ll simply rename the ‘Total Fatals’ column as ‘input_id.’
To summarize, we’ll create a function that does the following for each of the eight yearly URLs included in the urllist data frame:
# Generates a function to obtain the data:
extract_fun <- function(num){
# read HTML
webpage <- read_html(urldf$ulist[num])
# extract table and convert to data frame
extract_df <- webpage %>%
html_table() %>%
as.data.frame()
# note that the var names in each df appear in the first row
# extract var names, use to set column names, and then remove 1st row
new_names <- as.character(extract_df[1, ])
colnames(extract_df) <- new_names
extract_df <- extract_df[-1, ] # remove the first row
# using an if statement, rename the death counter column as 'input_id'
if (num != 7){
names(extract_df)[1] <- "input_id"
}
if (num == 7){
names(extract_df)[1] <- "empty_name" # first column in 2023 is empty
names(extract_df)[2] <- "input_id"
}
# create a column indicating the year
extract_df <- extract_df %>% mutate(year = num + 2016)
# The first table is 2017=1+2016, the second is 2018=2+2016, etc.
return(extract_df) # output df
}
This function will extract data from an HTML table and store it as an data frame called extract_df. Next, let’s create a list with 8 empty elements to store the data frame for each scraped table, call it dflist. We’ll then loop through 1 through 8, calling extract_fun() at each step and then assign the output into each element of dflist. This way, we’ll end up with a list of 8 data frames, one including each yearly table. We’ll also rename each element (data frame) by the corresponding year.
# create an empty list to store the results
df_list <- vector("list", 8)
# loop through 1 to 8 and apply extract_fun
for (i in 1:8) {
df_list[[i]] <- extract_fun(i)
}
# define the vector of years to use for df names
year_names <- as.character(2017:2024)
# assign table names
names(df_list) <- year_names
We now have a list (df_list) with a separate data frame for each year. We eventually want to append these data frames so that all of the information is in a single data frame. In order to do that, we need a consistent set of column names in each data frame—we can’t assume the underlying tables used the same set of column names, we need to check!
for (num in 1:8) {
print(df_list[[num]]$year[1])
print(names(df_list[[num]]))
}
## [1] 2017
## [1] "input_id" "Date of Indicent" "Name of Deceased"
## [4] "Location of Incident" "Related State Agency" "year"
## [1] 2018
## [1] "input_id" "Date of Incident" "Name of Deceased"
## [4] "Location of Incident" "Related State Agent" "NA"
## [7] "year"
## [1] 2019
## [1] "input_id" "Date of Incident" "Name of Deceased"
## [4] "Location of Incident" "Related State Agent" "year"
## [1] 2020
## [1] "input_id" "Date of Incident" "Name of Deceased"
## [4] "Location of Incident" "Related State Agent" "year"
## [1] 2021
## [1] "input_id" "Date of Incident" "Name of Deceased"
## [4] "Location of Incident" "Related State Agent" "year"
## [1] 2022
## [1] "input_id" "Date of Incident" "Name of Deceased"
## [4] "Location of Incident" "Related State Agency" "Mental Health Status"
## [7] "year"
## [1] 2023
## [1] "empty_name" "input_id" "Date"
## [4] "Deceased" "Location" "Security Force Unit"
## [7] "Mental Health Status" "year"
## [1] 2024
## [1] "input_id" "Date" "Name of Deceased"
## [4] "Location of Incident" "Related State Agency" "Weapon Recovery"
## [7] "Mental Health" "year"
If we look closely, we can see a number of inconsistencies. Let’s do some manual recoding using base R syntax. Alternatively, you could use tidyverse syntax and the case_when() function, for example.
names(df_list[["2023"]])[4] <- "Name of Deceased" # was Deceased
names(df_list[["2023"]])[5] <- "Location of Incident"# was Location
names(df_list[["2017"]])[2] <- "Date of Incident" # was Date of Indicent not Date of Incident
names(df_list[["2023"]])[3] <- "Date of Incident" # was Date
names(df_list[["2024"]])[2] <- "Date of Incident" # was Date
names(df_list[["2017"]])[5] <- "Related State Agent" # was Related State Agency
names(df_list[["2021"]])[5] <- "Related State Agent" # was Related State Agency
names(df_list[["2022"]])[5] <- "Related State Agent" # was Related State Agency
names(df_list[["2023"]])[6] <- "Related State Agent" # was Security Force Unit
names(df_list[["2024"]])[5] <- "Related State Agent" # was Related State Agency
Now that we have a consistent set of column names, let’s bind (or append) all of the data frames into a single data frame using the bind_rows() function. We’ll also rename the columns.
# append data frames together
killings <- bind_rows(df_list)
# Renaming columns
killings <- killings %>%
rename(name_deceased = `Name of Deceased`,
date_of_incident = `Date of Incident`,
location = `Location of Incident`,
state_agent = `Related State Agent`)
We’ve cleaned up the column names, now let’s check if there are any unnecessary rows. There are some blank rows, as well as some rows with sub-headers for each month:
One way to remove the sub-header rows is to search for the ‘:’ character in the state_agent column, and remove the rows where there is a match. There also many blank entries for that column, corresponding to subtotals or spacer rows in the original tables. Lastly, let’s select only the columns we need to count the number of police killings.
# Eliminate colons to eliminate rows that are Sub-headers.
killings <- killings %>% filter(grepl(":",state_agent) == 0)
killings <- killings %>% filter(grepl("Deceased", name_deceased) == 0)
# Filter out remaining empty rows by searching for empty name_deceased
# Note: some rows lack a value for the input_id column but still record a killing, so we don't want to discard these
killings <- killings %>% filter(name_deceased != "")
# Let's keep only the columns of interest
killings <- killings %>%
dplyr::select(input_id, date_of_incident, name_deceased, location, state_agent, year)
We now have a tidy data frame with day-parish as the unit of analysis. In other words, each rows lists all killings that occurred in each parish. But we need to count the number of police killings for every day, identify the parish, and construct a parish-year panel data set. We’ll need to prepare some intermediate variables before shaping the dataset into a long-form panel data set.
Let’s generate a variable with the number of deceased for every record. Many of the rows have a single integer input_id and represent one killing, but we need to carefully check for different situations. First, let’s look at the rows with a blank input_id. We can see that they all represent one death:
filtered_killings <- killings %>%
filter(is.na(input_id) == TRUE | input_id == "") %>%
select(input_id, name_deceased)
filtered_killings
Next, let’s investigate rows that include multiple victims:
We can see that some rows include multiple killings separated by commas. For other years, the multiple killings are separated by dashes rather than commas. We want to calculate the total number of killings for each row. What logic should we implement? Well, we could search for the first non-numeric character, e.g. a comma or dash or space, and take the digits to the left. For row 5 in the above screenshot, that would leave us with the number 6. Then we could search for the last non-numeric character and take the digits to the right, which would leave us with the number 11. Then we take the difference and add 1, 11-6 + 1 = 6 killings. To implement this logic, we can use the
str_extract() function along with the appropriate regular expression. Lastly, for rows where input_id is blank, we’ll set the number of killings equal to 1.
killings <- killings %>%
mutate(
# Extract the first number (before the first comma, space, or dash)
first_num = as.numeric(str_extract(input_id, "^\\d+")),
# Extract the last number (after the last comma, space, or dash)
last_num = as.numeric(str_extract(input_id, "\\d+$")),
# Regex tokens:
# ^ Match the start of a string
# \ Escape a special character
# \d Matches any digit
# + Match the preceding item 1 or more times
# $ Match the end of a string
# regex cheatsheet: https://www.keycdn.com/support/regex-cheat-sheet
# Compute deceased count
deceased = case_when(
# if input_id is empty, assign deceased = 1
is.na(input_id) == TRUE | input_id == "" ~ 1,
# otherwise construct the difference as per our recoding logic
!is.na(first_num) & !is.na(last_num) ~ last_num - first_num + 1,
)
) %>%
select(-first_num, -last_num)
We also want to construct a date column, this will allow us to analyze trends at, say, the monthly or quarterly level, rather than annually. What logic should we implement? The lubridate package includes functions for converting character strings into date, but first we need to generate a column with a recognizable character string that includes the day, month, and year. The date_of_incident column includes the day and month, but not the year. So let’s use the str_c() function to combine the date_of_incident and year columns.
killings <- killings %>%
mutate(date = str_c(date_of_incident, "-", as.character(year))) %>%
mutate(date = dmy(date)) %>% # use dmy() from lubridate to recognize day-month-year format
select(-date_of_incident) # drop date_of_incident as redundant
There are 14 parishes in Jamaica, which are the main local government jurisdictions. The police force is national, known the Jamaica Constabulary Force (JCF). The killings included here are coded by the agency responsible, either JCF, JDF (Jamaica Defense Force), or DCS (the Department of Correctional Services).
What logic should we use to identify the parish where each killing occurred? Let’s look in the location column for a string match with one of the parish names. To do this, we need to eliminate any spelling mistakes and capitalization inconsistencies. Let’s convert everything to lower case to avoid capitalization inconsistencies. We’ll also need to manually correct some spelling mistakes.
# Standardize format of location of incident by turning everything to lower case
killings <- killings %>% mutate(location = tolower(location))
# Fix wrong addresses:
killings$location <- gsub("cathrine", "catherine", killings$location)
killings$location <- gsub("catheirne", "catherine", killings$location)
killings$location <- gsub("clerendon", "clarendon", killings$location)
killings$location <- gsub("claredndon", "clarendon", killings$location)
killings$location <- gsub("westmotreland", "westmoreland", killings$location)
killings$location <- gsub("knigston", "kingston", killings$location)
killings$location <- gsub("kingaston", "kingston", killings$location)
killings$location <- gsub(".", " ", killings$location, fixed = TRUE) #replace . with space, e.g. 'st. ' to 'st "
killings$location <- gsub("\\s+"," ", killings$location) #remove duplicate spaces
Next let’s create a character string that includes all of the parish names in lower case. Then we can use the str_extract() function to extract matched text in the location column, i.e. the matched parish name.
# paste(..., collapse = "|") creates a pattern that matches any of the parish names (i.e., "kingston|andrew|thomas|...")
parishes <- paste(c("kingston", "st andrew", "st thomas", "portland",
"st mary", "st ann", "trelawny", "st james", "hanover",
"westmoreland", "st elizabeth", "manchester",
"clarendon", "st catherine"), collapse = "|")
# extract the parish from the location column and assign to parish column
killings <- killings %>%
mutate(parish = str_extract(location, parishes))
killings %>% select(location, parish) %>% head() # glimpse results
summary(as.factor(killings$parish)) # show distribution by parish (pooled over all years)
## clarendon hanover kingston manchester portland st andrew
## 70 32 285 25 14 34
## st ann st catherine st elizabeth st james st mary st thomas
## 31 209 35 91 36 26
## trelawny westmoreland NA's
## 18 61 21
Success! We have created a new column that identifies
Now we are ready to create a tidy, long-form panel data set with parish-year as the unit of analysis. Let’s start by looking at a cross-tab between parish and year.
table(killings$parish, killings$year, useNA = "always")
##
## 2017 2018 2019 2020 2021 2022 2023 2024 <NA>
## clarendon 10 9 5 9 6 7 10 14 0
## hanover 6 6 3 1 3 1 1 11 0
## kingston 53 37 24 26 43 36 34 32 0
## manchester 1 1 2 5 1 2 3 10 0
## portland 2 2 0 2 3 2 2 1 0
## st andrew 1 4 2 6 8 7 3 3 0
## st ann 3 6 4 4 2 4 3 5 0
## st catherine 24 18 16 19 26 19 38 49 0
## st elizabeth 3 3 1 2 5 9 9 3 0
## st james 21 7 12 10 4 10 16 11 0
## st mary 6 6 3 5 3 4 5 4 0
## st thomas 3 2 2 3 3 3 5 5 0
## trelawny 1 3 1 3 1 3 2 4 0
## westmoreland 10 10 1 4 6 8 12 10 0
## <NA> 17 3 0 0 0 0 0 1 0
This looks like what we want! We can just convert the above output to a data frame, and make sure the new columns have the right names and storage type. This will give us a long-form panel data set. We may also want to exclude the NAs, depending on the analysis we want to do. The killings with parish as NA are still killings which should be included in countrywide totals. But here we will focus on parish-level trends, so we will exclude observations with recorded locations that don’t match with any of the parish names.
panel <- data.frame(table(killings$parish, killings$year, useNA = "no")) %>%
rename(parish = Var1,
year = Var2,
killings = Freq) %>%
mutate(year = as.integer(as.character(year))) %>%
arrange(parish, year)
panel
Let’s check that our long-form data frame is balanced, i.e. one observation per parish per year:
table(panel$parish, panel$year)
##
## 2017 2018 2019 2020 2021 2022 2023 2024
## clarendon 1 1 1 1 1 1 1 1
## hanover 1 1 1 1 1 1 1 1
## kingston 1 1 1 1 1 1 1 1
## manchester 1 1 1 1 1 1 1 1
## portland 1 1 1 1 1 1 1 1
## st andrew 1 1 1 1 1 1 1 1
## st ann 1 1 1 1 1 1 1 1
## st catherine 1 1 1 1 1 1 1 1
## st elizabeth 1 1 1 1 1 1 1 1
## st james 1 1 1 1 1 1 1 1
## st mary 1 1 1 1 1 1 1 1
## st thomas 1 1 1 1 1 1 1 1
## trelawny 1 1 1 1 1 1 1 1
## westmoreland 1 1 1 1 1 1 1 1
Indeed it is!
Next let’s load and join in information on the population for each parish. Then we can calculate the number of killings per 10,000 population.
# load and prepare data with population by parish
pop_raw <- read.csv("DZ-2025-04-08-17-50-38.csv")
pop <- pop_raw %>% filter(Series.title != "Total") %>%
mutate(Series.title = tolower(Series.title)) %>%
dplyr::select(Series.title, X1.1.2019) %>%
rename("pop" = "X1.1.2019",
"parish" = "Series.title")
# generate killings per 10,000 pop
panel_pop <- panel %>%
inner_join(pop, by = c("parish")) %>%
mutate(killings_pc = round(killings / (pop/10000), 2)) %>%
select(-pop)
Let’s plot the trend in killings per 10,000 population for each parish.
panel_pop %>%
ggplot(aes(x = year, y = killings_pc, color = parish)) +
geom_line() +
labs(title = '',
x = '',
y = 'killings per 10,000 population')
If we squint we can make out some parishes that experienced bigger increases, but the scaling makes it hard to interpret: the capital, Kingston, has a much higher killing rate than the other 13 parishes, which are compressed into the bottom of the plot space. How can we use the plot space more efficiently? One approach is to rescale each value as a percentage of the value in the initial year, 2017.
First, let’s reshape the data frame from long form to wide form using pivot_wider(), as this will make it easier to create variables based on changes over time:
panel_wide <- panel_pop %>%
select(-killings) %>%
pivot_wider(names_from = year, # obtain new col names from year col
names_prefix = "kpc", # add the prefix kpc to each new col name
values_from = killings_pc) # obtain values for new columns
panel_wide
Now it’s easy to compute the percentage change using this wide form data frame. Let’s do that, convert back to long form data, join the new data into the panel_pop, and then plot the new time series.
panel_index <- panel_wide %>%
mutate(killings_pc_index_2017 = 100,
killings_pc_index_2018 = round(100*kpc2018/kpc2017, 2),
killings_pc_index_2019 = round(100*kpc2019/kpc2017, 2),
killings_pc_index_2020 = round(100*kpc2020/kpc2017, 2),
killings_pc_index_2021 = round(100*kpc2021/kpc2017, 2),
killings_pc_index_2022 = round(100*kpc2022/kpc2017, 2),
killings_pc_index_2023 = round(100*kpc2023/kpc2017, 2),
killings_pc_index_2024 = round(100*kpc2024/kpc2017, 2)) %>%
select(-(kpc2017:kpc2024)) %>% # drop original cols with killings per 10k
pivot_longer(cols = c(killings_pc_index_2017:killings_pc_index_2024), # reshape back to long form
names_to = "year",
values_to = "kpci") %>%
mutate(year = as.integer(str_sub(year, 19, 22))) # obtain year from last 4 chars of each string
# join panel_index to panel_pop so all vars are in one data frame
panel_full <- panel_pop %>%
inner_join(panel_index, by = c("year", "parish"))
# plot new time series
panel_full %>%
ggplot(aes(x = year, y = kpci, color = parish)) +
geom_line() +
labs(title = '',
x = '',
y = 'killings as a percent of 2017 levels')
This plot is still a bit too cluttered, though we can now make out that the parish of Manchester experienced the largest proportional increase over the period under study.
In this case, a better approach might just be to show a separate plot for each parish by faceting.
panel_full %>%
ggplot(aes(x = year, y = killings_pc, color = parish)) +
geom_line(show.legend = FALSE) +
facet_grid(~parish) +
labs(title = 'Killings per 10,000 population by parish',
x = '',
y = 'killings per 10,000 population') +
scale_x_continuous(breaks = seq(2018,2024,4),
labels = seq(2018,2024,4),
minor_breaks = seq(2017,2024,1) ) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 1,
size = 8))
This plot shows quite clearly how the distribution of police killings relative to population for Kingston is shifted well above the other parishes. This empirical observation makes it difficult to visualize time variation in police killings in the other parishes without adjusting for the large, positive ‘fixed effect’ for Kingston.
Finally, we can map the results by parish. But how can we incorporate time variation into a map? One approach is to make an animated map that cycles through the years, using the gganimate package. This package “extends the grammar of graphics as implemented by ggplot2 to include the description of animation.”
First, let’s create a static map. We need a shapefile that includes the geometry with parish boundaries. To obtain the shapefile we can try googling, or consider using an R package that has built in shapefiles. One such package is rnaturalearth.
jm_parish <- rnaturalearth::ne_states("Jamaica", returnclass = "sf")
class(jm_parish)
## [1] "sf" "data.frame"
Note the classes of the jm_parish object: it’s a data frame with the additional class sf. In other words, it’s a simple features object that allows for a single column to store all of the geometry: an sfc_MULTIPOLYGON. Let’s do a bit of cleaning to make the parish names identical to the scraped data, and then inspect the resulting data frame.
jm_parish <- jm_parish %>%
mutate(parish = tolower(name)) %>%
mutate(parish = gsub("saint", "st", parish)) %>%
dplyr::select(geometry, parish)
head(jm_parish)
str(jm_parish)
## Classes 'sf' and 'data.frame': 14 obs. of 2 variables:
## $ parish : chr "st ann" "trelawny" "st james" "hanover" ...
## $ geometry:sfc_MULTIPOLYGON of length 14; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:70, 1:2] -77.5 -77.5 -77.4 -77.4 -77.4 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "parish"
Now that we have a data frame with the geometry for parish boundaries, we can make a base map of Jamaica’s parishes using the geom_sf() layer. Note how simple it is to use geom_sf(): because the data frame we are working with is a sf object, geom_sf() knows which column to use and we don’t even need to specify any aesthetic mappings!
jm_parish %>% ggplot() + geom_sf()
Now that we have a base map, let’s join jm_parish to the scraped data so that we can map information on police killings by parish. This is why it was critical to ensure that both data frames include a column that identifies the parish with the exact same spelling and capitalization.
panel_long_map <- jm_parish %>%
right_join(panel_full, by = "parish")
panel_long_map
Now we can plot killings per capita in, say, 2024, by specifying the fill aesthetic as the column we want to map, killings_pc. We’ll also expand our ggplot() call to make the visualization look nicer with a clear legend and labelling.
panel_long_map %>%
filter(year == 2024) %>%
ggplot() +
geom_sf(aes(fill = killings_pc)) +
theme(plot.title=element_text(size = 12),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_gradient(limits = c(0, 6),
breaks = c(0, 2, 4, 6),
low = "white", high = "purple", na.value="grey80") +
labs(title = 'Police killings per 10,000 population',
fill='')
This looks pretty good, but note how Kingston is an extreme value with a very high number of killings relative to population. This makes it so the other 13 parishes are clustered near the bottom of the fill range, i.e. lightly shade. Instead, we can try plotting killings per 10,000 indexed to 2017 levels.
basemap <- panel_long_map %>%
filter(year == 2024) %>%
ggplot() +
geom_sf(aes(fill = kpci)) +
theme(plot.title=element_text(size = 12),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_gradient(limits = c(0, 1100),
breaks = c(0, 250, 500, 750, 1000),
low = "white", high = "purple", na.value="grey80") +
labs(title = 'Police killings indexed to 2017 levels',
fill='')
basemap
Now it’s the parish of Manchester that stands out as the number of police killings increased from 1 to 10, so 2024 is 1000% of the value in 2017! In this plot Kingston is shaded white as the parish went from 53 to 32 killings, which is only 60% of the 2017 level.
Finally, we can animate a map of indexed police killings per capita over time using gganimate. First let’s create a base map similar to the 2024 map but without filtering for a specific year. We’ll just filter out observations for the year 2017, as the index value for the base year is 100 in all parishes.
basemap <- panel_long_map %>%
filter(year > 2017) %>%
ggplot() +
geom_sf(aes(fill = kpci)) +
theme(plot.title=element_text(size = 12),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_gradient(limits = c(0, 1100),
breaks = c(0, 250, 500, 750, 1000),
low = "white", high = "purple", na.value="grey80") +
labs(title = 'Police killings indexed to 2017 levels',
fill='')
basemap
gganimate allows for transitions to animate changes in the data over time or between different states. We would like an animation that transitions between different years. The transition_states() function creates smooth transitions between discrete states, i.e. between years. Additionally, we can use geom_text() to display the corresponding year on the plot space.
basemap +
transition_states(states = year) +
geom_text(aes(y = 17.8, x = -78.2, label = as.character(year)),
check_overlap = TRUE,
size = 6)
Note how this map doesn’t show where killings relative to population are high; rather, it shows where killings have increased more relative to 2017. Alternatively, we can stick with a map of killings per capita to see where and when killings relative to population are high relative to other parishes. The downside is that the variation between Kingston and the other parished is so large, it makes it difficult to visualize variation over time in any of the parishes.
basemap <- panel_long_map %>%
ggplot() +
geom_sf(aes(fill = killings_pc)) +
theme(plot.title=element_text(size = 12),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_fill_gradient(low = "white", high = "purple", na.value="grey80") +
labs(title = 'Police killings per 10,000 population',
fill='')
basemap +
transition_states(states = year) +
geom_text(aes(y = 17.8, x = -78.2, label = as.character(year)),
check_overlap = TRUE,
size = 6)
To summarize, we’ve tried to map the distribution of killings relative to population using two approaches: (1) killings per 10,000 population indexed to 2017 levels, which highlights within parish variation over time; and (2) killings per 10,000 population, which highlights the large variation between Kingston and other parishes. Both approaches are imperfect, in this case the most effective visualization may be the faceted plots for each parish.
For more information on gganimate, here’s a short tutorial.